Overview: Correlation
A correlation is a single number used to indicate the strength of a linear relationship between two random variables. A correlation matrix gives the correlations between all pairs of data sets, and is always symmetrical. In statistics, these relations are measured by the calculation of coefficients of correlation. These coefficients will always be between -100% and +100%, indicating the following:
-
+100% = perfect positive correlation,
-
0 = no correlation at all, and
-
−100% = perfect negative correlation.
A high correlation does not always indicate dependence between two variables; it may be that there is a third, unstated variable upon which both depend. It is therefore important to know your process and have an understanding of your dataset.
Use this operation to create a correlation matrix of selected fields. The correlation matrix can be created using either an index or timestamp as a base for the correlation calculation.
Index based:
-
The correlation is calculated over the number of rows in the dataset.
-
This is used for discrete and batch processes.
-
The correlation calculation for a process where the sampling rate is constant and there is a standard time span between successive timestamps will be the same as when calculating correlation over the number of rows in a dataset.
Time based:
-
The correlation calculation takes into account the amount of time a data point remains at a value.
-
The difference between timestamp values is used for the calculation.
-
Time based correlation is generally used for continuous processes.
-
The correlation calculation for a process where the sampling rate is constant and there is a standard time span between successive timestamps will be the same as when calculating correlation over the number of rows in a dataset.
-
Where data values remain constant over a number of timestamps, the correlation calculation takes this into account and determines the correlation of fields based on the duration of the data value remaining constant.
This operation will create a new dataset, containing the correlation matrix.
Properties
Category: |
Transform |
Performance risk: |
High potential performance risk. The performance is influenced by the size of the text file being imported, which is determined by number of rows and/or the number of columns of the dataset. |
Knowledge required: |
Working knowledge of the software. |
Effect on datasets
How many datasets are required to perform this operation? |
One |
Does it create a new dataset? |
Yes |
Can you reconfigure this operation? |
Yes |
Can you apply this operation to a locked dataset? |
Yes |
Does it modify the current dataset in any way? |
No |
Requirements
-
A dataset with at least two selected fields with double and/or integer values.
-
One timestamp field if a time based correlation is being calculated.
Note: This operation does not support milliseconds. First use the Resample operation to remove milliseconds present in your data source.
Results
-
A correlation matrix showing the relationship between the fields of the dataset.
-
The correlation is displayed as a percentage, and is either a negative or positive value, indicating the type of correlation relationship between the fields.
-
Rows containing empty or null values cannot be used in the correlation calculations. This operation will only use the rows that contain valid values in the specified fields for the correlation calculations.
Related topics: